Listen Top Shows Blog

Takes on "Alignment Faking in Large Language Models"

Takes on "Alignment Faking in Large Language Models"

Update: 2024-12-18

Share

Description

What can we learn from recent empirical demonstrations of scheming in frontier models? Text version here: https://joecarlsmith.com/2024/12/18/takes-on-alignment-faking-in-large-language-models/

Comments

In Channel

How human-like do safe AI motivations need to be?

How human-like do safe AI motivations need to be?

2025-11-1201:23:32

Leaving Open Philanthropy, going to Anthropic

Leaving Open Philanthropy, going to Anthropic

2025-11-0332:09

Controlling the options AIs can pursue

Controlling the options AIs can pursue

2025-09-2955:34

Giving AIs safe motivations

Giving AIs safe motivations

2025-08-1801:23:25

The stakes of AI moral status

The stakes of AI moral status

2025-05-2137:29

Can we safely automate alignment research?

Can we safely automate alignment research?

2025-04-3001:29:38

AI for AI safety

AI for AI safety

2025-03-1427:51

Paths and waystations in AI safety

Paths and waystations in AI safety

2025-03-1118:07

When should we worry about AI power-seeking?

When should we worry about AI power-seeking?

2025-02-1946:54

What is it to solve the alignment problem?

What is it to solve the alignment problem?

2025-02-1340:13

How do we solve the alignment problem?

How do we solve the alignment problem?

2025-02-1308:43

Fake thinking and real thinking

Fake thinking and real thinking

2025-01-2801:18:47

Takes on "Alignment Faking in Large Language Models"

Takes on "Alignment Faking in Large Language Models"

2024-12-1801:27:54

(Part 2, AI takeover) Extended audio from my conversation with Dwarkesh Patel

(Part 2, AI takeover) Extended audio from my conversation with Dwarkesh Patel

2024-09-3002:07:33

(Part 1, Otherness) Extended audio from my conversation with Dwarkesh Patel

(Part 1, Otherness) Extended audio from my conversation with Dwarkesh Patel

2024-09-3003:58:38

Introduction and summary for "Otherness and control in the age of AGI"

Introduction and summary for "Otherness and control in the age of AGI"

2024-06-2112:23

Second half of full audio for "Otherness and control in the age of AGI"

Second half of full audio for "Otherness and control in the age of AGI"

2024-06-1804:11:02

First half of full audio for "Otherness and control in the age of AGI"

First half of full audio for "Otherness and control in the age of AGI"

2024-06-1703:07:29

Loving a world you don't trust

Loving a world you don't trust

2024-06-1701:03:54

On attunement

On attunement

2024-03-2544:14

00:00

00:00

x

Takes on "Alignment Faking in Large Language Models"

Takes on "Alignment Faking in Large Language Models"

Joe Carlsmith